October 2020

Lecture 3: plotting

Part 1: Why do we plot?

Why do we plot

Why do we want to plot data?

  • we are human beings – we are pattern recognizers
  • we can see things we are not able to grasp from data
  • good to explore a dataset and look for regularities
  • good to convey a clear message
  • to have fun
  • (to show your colleagues how nice your plot is)

The importance of plotting: eyeballing the data

Looking at the data as a first step of analysis is always a good idea

  • data could look similar at a first glance
  • and even have similar descriptive statistics (i.e. mean, variance)
  • but still be very different in practice

An example: the Datasaurus

A striking example of this is the “Datasaurus dozen”: a dull an not impressive dataset.

  • data contains vars x and y, over 13 different conditions
  • import the data (it is in data/DatasaurusDozen.tsv) and compute mean and st.dev. by dataset
df <- read_tsv("data/DatasaurusDozen.tsv")
df %>% 
  group_by(dataset) %>% 
  summarise(mean_x = round(mean(x),2), mean_y = round(mean(y),2)) %>% 
  kable()
dataset mean_x mean_y
away 54.27 47.83
bullseye 54.27 47.83
circle 54.27 47.84
dino 54.26 47.83
dots 54.26 47.84
h_lines 54.26 47.83
high_lines 54.27 47.84
slant_down 54.27 47.84
slant_up 54.27 47.83
star 54.27 47.84
v_lines 54.27 47.84
wide_lines 54.27 47.83
x_shape 54.26 47.84

Datasaurus, plotted

But if you plot it, you’ll see stark differences

The importance of plotting: compact information

Plotting allows one to convey a lot of information in a compact way

  • humans are pattern recognizers
  • several geometric objects can convey meaning
    • position (x,y)
    • color
    • size
    • shape
  • you can combine multiple plots to create infographics (cool!)

Good plots, bad plots

  • It is important to make good plots
  • i.e., plots that look good
  • …and are honest to the data

  • it is very easy to hide the message rather than highlighting it
  • it is very easy to mislead with a plot
  • so let’s start with a gallery of bad plots. Can you guess why they are bad?

Bad plotting 1

Bad plotting 2

Bad plotting 3

Bad plotting 4

Bad plotting 5

Bad plotting 5 (really, you don’t need 3D plots)

The road to good plotting

  • know your data
  • think before you hit the enter button
  • sketch on paper first
  • be honest
  • draw your axis first
  • choose your visualization wisely
  • a good plot gives lots of precise information in a concise way.

Examples:

Good plots, 1

Good plots, 2

Good plots, 3

Good plots, 4

Plotting with ggplot

Some data

We will start by using the built-in dataset mpg

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows

A look at the data

  • model : model name
  • displ : engine displacement, in litres
  • year : year of manufacture
  • cyl : number of cylinders
  • trans : type of transmission
  • drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  • cty : city miles per gallon
  • hwy : highway miles per gallon
  • fl : fuel type
  • class : “type” of car

A look at the data

skimr::skim(mpg)
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
manufacturer 0 1 4 10 0 15 0
model 0 1 2 22 0 38 0
trans 0 1 8 10 0 10 0
drv 0 1 1 1 0 3 0
fl 0 1 1 1 0 5 0
class 0 1 3 10 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
displ 0 1 3.47 1.29 1.6 2.4 3.3 4.6 7 ▇▆▆▃▁
year 0 1 2003.50 4.51 1999.0 1999.0 2003.5 2008.0 2008 ▇▁▁▁▇
cyl 0 1 5.89 1.61 4.0 4.0 6.0 8.0 8 ▇▁▇▁▇
cty 0 1 16.86 4.26 9.0 14.0 17.0 19.0 35 ▆▇▃▁▁
hwy 0 1 23.44 5.95 12.0 18.0 24.0 27.0 44 ▅▅▇▁▁

We will be using ggplot2. Why?

Advantages of ggplot2

  • consistent underlying grammar of graphics (Wilkinson, 2005)
  • plot specification at a high level of abstraction
  • very flexible
  • mature and complete graphics system
  • theme system for polishing plot appearance
  • many users, active, fast & competent support

What is a grammar of graphics?

The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:

  • data
  • aesthetic mapping
  • geometric object
  • statistical transformations

  • scales
  • coordinate system
  • position adjustments
  • faceting

Starting from the basics

As in a grammar the minimal sentence is a subject in a plot the minimal object is data

ggplot(mpg)

basics

In a grammar, you need a verb. In plots, this is axis

p <- ggplot(mpg, aes(x = displ, y = hwy))
p

Still no plot generated!

Generating a plot

But you also need an object. In ggplot, this is geoms

p + geom_point()

Generating a plot, 2

But you also need an object. In ggplot, this is geoms

p + geom_smooth()

Generating a plot, 3

You can add (+) as many geoms as you wish

p + geom_smooth()+geom_point()

The beauty of a grammar metaphor

  • once you get the main idea, adding things is easy
  • a plot is a sentence made with data
  • you add layers with +
  • as you would add words to a sentence
  • as in grammar you use adjectives to give more nuanced meaning, in plots you could use + to add color, fill, size, shape, etc…

Adding meaning: color

p + geom_point(aes(color=class))

Adding meaning: size

p + geom_point(aes(size=cyl))

Adding meaning: color AND size

p + geom_point(aes(size = cyl, color=class))

Adding meaning: shape

p + geom_point(aes(shape=fl))

Adding meaning: all together (maybe too much)

p + geom_point(aes(color=manufacturer, shape =fl, size = cyl))

Recap so far

  • ggplot works like a grammar
  • start with ggplot()
  • first argument is “the subject”, i.e. data: ggplot(df, ...)
  • then you map variables to aesthetics (x, y, color, fill, shape, size, …)
  • ggplot(df, aes(dimension = variable))
  • then you add (+) meaning with geometric objects: geom_*
  • + geom_line()
  • notes:
    • geoms inherit the aes of the plot if not specified
    • all variables mapped to aes vary with the data

Facets

  • sometimes sentences become a bit too long
  • it is useful to split them up in shorter sentences
  • for instance, you could first talk about a car, then another one
  • in plots, you can split up the plot along a variable
  • so that one plot is drawn for each level of a given variable, say type of fuel

Facets

p + geom_point(aes(color=manufacturer, size = cyl))+facet_grid(.~fl)

More details on the grammar

A ggplot is made up of

  • data (subject)
  • axis (verb)
  • geoms (object)
  • aesthetic layers (size, fill color, shape, label, …)
  • facets (splitting sentences)

And then you can change how things look and behave: - coordinate functions (changing the axis appearance and type) - scale functions (changing the appearance of the geoms) - theme functions (changing the appearance of the plot itself)

Exploring data with plots: one variable

Plot types depend on the variable type

  • one-variable plots, discrete variable: barplot
  • one-variable plots, continuous variable: distribution, density

Barplots

  • let’s look at the drive type of the cars: front, rear, or 4wd
p <- ggplot(mpg, aes(drv))
p + geom_bar()

Barplots

  • not so fancy. should we add color?
p <- ggplot(mpg, aes(drv))
p + geom_bar(aes(color=drv))

Barplots

  • ups. Maybe we meant fill?
p <- ggplot(mpg, aes(drv))
p + geom_bar(aes(fill=drv))

Barplots

  • what if we cross it with another variable?
p <- ggplot(mpg, aes(drv))
p + geom_bar(aes(fill=class))

Barplots

  • By default stacked. How to unstack?
p <- ggplot(mpg, aes(drv))
p + geom_bar(aes(fill=class), position = position_dodge())

Barplots

  • By default stacked. How to show relative weight?
p <- ggplot(mpg, aes(drv))
p + geom_bar(aes(fill=class), position = position_fill())

One variable, continuous: mpg on highway

  • When the variable is continuous, it makes more sense to show distributions
p <- ggplot(mpg, aes(hwy))
p + geom_histogram()

Histograms: binwidth

p + geom_histogram(bins = 10)

Histograms: binwidth

p + geom_histogram(bins = 100)

An alternative do histogram: dotplot

p + geom_dotplot(binwidth = 0.5)

Continuous distribution: Kernel Density Estimation

p + geom_density()

Continuous distribution: Kernel Density Estimation

p + geom_density(adjust = 3)

Continuous distribution: Kernel Density Estimation

p + geom_density(adjust = 0.5)

Exploring data with plots: two variables

Plot types depend on the variable type

  • both vars continuous: scatter, smooth
  • one continuous, one discrete: columns (i.e., bars), boxplot, violins
  • both discrete: count

Scatter

if two variables are continuous, your choice is scatter

p <- ggplot(mpg, aes(x = cty, y = hwy))
p + geom_point()

Smooth

still, you might just want to show the general tendency

p + geom_smooth()

Scatter + smooth

or both

p + geom_smooth() + geom_point()

Columns: a special type of bars

one variable discrete, the other continuous (note: it needs a summarise())

mpg %>% group_by(manufacturer) %>% summarise(n = n()) %>% 
ggplot(aes(manufacturer, n))+
  geom_col()

Columns: why bother?

the above could have been easily done with geom_bar (that counts for us)

mpg %>% ggplot(aes(manufacturer))+
  geom_bar()

Columns: a special type of bars

but columns give you more options, since now you condition on a proper variable (n). For instance: order by n

mpg %>% group_by(manufacturer) %>% summarise(n = n()) %>% 
ggplot(aes(reorder(manufacturer, -n), n))+
  geom_col()

Boxplots

boxplots show a distribution but can do so over different levels of a categorical var

mpg %>% ggplot(aes(drv, hwy))+
  geom_boxplot()

An alternative to boxplot: violin

boxplots are bulky and only show relevant info. Want full distribution? Use violins

mpg %>% ggplot(aes(drv, hwy))+
  geom_violin()

An alternative to boxplot: violin

remember: all is modular. We can always add color, fill…

mpg %>% ggplot(aes(drv, hwy, color = drv, fill = drv))+
  geom_violin()

An alternative to boxplot: violin

remember: all is modular. …facets

mpg %>% ggplot(aes(drv, hwy, color = drv, fill = drv))+
  geom_violin()+
  facet_grid(.~year)

Counts

if both variables are categorical, you can count their cross-tabulation

mpg %>% ggplot(aes(fl, drv))+
  geom_count()

Exploring data with plots: three variables

Plot types depend on the variable type

  • all continuous: contour plot (think: elevation in maps)
  • some discrete: tile

Tile

two variables define the x,y grid. A third defines the color of the cell. city consumption by year and drive (note: usually requires summarise())

mpg %>% group_by(year, drv) %>% summarise(n = n()) %>% 
  ggplot(aes(x = drv, y = year, fill = n)) +  geom_tile()

Additional resources